Make `.` the default value for `value` in HTML/XML extraction details

It represents "the context item" in XPath, and it is a reasonable and
convenient default especially when scraping a site upon an incoming
event.

Akinori MUSHA 9 anni fa
parent
commit
c0ce33a2d6
1 ha cambiato i file con 3 aggiunte e 3 eliminazioni
  1. 3 3
      app/models/agents/website_agent.rb

+ 3 - 3
app/models/agents/website_agent.rb

@@ -31,7 +31,7 @@ module Agents
31 31
 
32 32
       # Scraping HTML and XML
33 33
 
34
-      When parsing HTML or XML, these sub-hashes specify how each extraction should be done.  The Agent first selects a node set from the document for each extraction key by evaluating either a CSS selector in `css` or an XPath expression in `xpath`.  It then evaluates an XPath expression in `value` on each node in the node set, converting the result into string.  Here's an example:
34
+      When parsing HTML or XML, these sub-hashes specify how each extraction should be done.  The Agent first selects a node set from the document for each extraction key by evaluating either a CSS selector in `css` or an XPath expression in `xpath`.  It then evaluates an XPath expression in `value` (default: `.`) on each node in the node set, converting the result into string.  Here's an example:
35 35
 
36 36
           "extract": {
37 37
             "url": { "css": "#comic img", "value": "@src" },
@@ -39,7 +39,7 @@ module Agents
39 39
             "body_text": { "css": "div.main", "value": ".//text()" }
40 40
           }
41 41
 
42
-      "@_attr_" is the XPath expression to extract the value of an attribute named _attr_ from a node, and ".//text()" is to extract all the enclosed texts. To extract the innerHTML, use "./node()"; and to extract the outer HTML, use  ".". 
42
+      "@_attr_" is the XPath expression to extract the value of an attribute named _attr_ from a node, and ".//text()" is to extract all the enclosed texts. To extract the innerHTML, use "./node()"; and to extract the outer HTML, use  ".".
43 43
 
44 44
       You can also use [XPath functions](http://www.w3.org/TR/xpath/#section-String-Functions) like `normalize-space` to strip and squeeze whitespace, `substring-after` to extract part of a text, and `translate` to remove comma from a formatted number, etc.  Note that these functions take a string, not a node set, so what you may think would be written as `normalize-space(.//text())` should actually be `normalize-space(.)`.
45 45
 
@@ -373,7 +373,7 @@ module Agents
373 373
         case nodes
374 374
         when Nokogiri::XML::NodeSet
375 375
           result = nodes.map { |node|
376
-            case value = node.xpath(extraction_details['value'])
376
+            case value = node.xpath(extraction_details['value'] || '.')
377 377
             when Float
378 378
               # Node#xpath() returns any numeric value as float;
379 379
               # convert it to integer as appropriate.